Matrices and Dataframes

StartR Workshop

Maik Bieleke, PhD

University of Konstanz

November 23, 2024

Matrices

What are matrices?

Matrices are combinations of vectors. Like vectors, they can only contain values of the same type (e.g., numeric or character). If you combine vectors of different types, they will be coerced into the same type.

There are two main ways to create matrices in R:

  • from existing vectors with cbind() and rbind()
  • from scratch with matrix()

Matrices from existing vectors: cbind()

# Create vectors
x <- c(10, 11, 12, 13, 14, 15)
y <- c(20, 21, 22, 23, 24, 25)
z <- c(30, 31, 32, 33, 34, 35)

The cbind() function combines vectors into a matrix by binding them together by column.

cbind(x, y, z)
      x  y  z
[1,] 10 20 30
[2,] 11 21 31
[3,] 12 22 32
[4,] 13 23 33
[5,] 14 24 34
[6,] 15 25 35

Note that the vector names are used as column names.

Matrices from existing vectors: rbind()

# Create vectors
x <- c(10, 11, 12, 13, 14, 15)
y <- c(20, 21, 22, 23, 24, 25)
z <- c(30, 31, 32, 33, 34, 35)

The rbind() function combines vectors into a matrix by binding them together by row

rbind(x, y, z)
  [,1] [,2] [,3] [,4] [,5] [,6]
x   10   11   12   13   14   15
y   20   21   22   23   24   25
z   30   31   32   33   34   35

Note that the vector names are used as row names.

Matrices from scratch: matrix()

The matrix() function is an explicit way to create matrices from scratch. It takes the following arguments:

  • data: a vector containing the data
  • nrow: the number of rows
  • ncol: the number of columns
  • byrow: logical value indicating whether the matrix should be filled by row (FALSE, default) or by column (TRUE)

Examples

# Create a vector with data
data <- 1:10

The matrix function can transforms the data into a variety of matrices.

# Create a matrix with 2 rows and 5 columns
matrix(data, nrow = 2, ncol = 5)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
# Now fill by row instead of by column
matrix(data, nrow = 2, ncol = 5, byrow = TRUE)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
# Create a matrix with 5 rows and 2 columns
matrix(data, nrow = 5, ncol = 2)
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Indexing with matrices

  • Remember that vectors are indexed like this: x[i]

    # Create a vector with values
    vec <- 12:1
    vec
     [1] 12 11 10  9  8  7  6  5  4  3  2  1
    vec[3]
    [1] 10
  • Matrices are indexed like this: x[i, j]

    # Create a matrix
    mat <- matrix(vec, nrow = 4, ncol = 3)
    mat
         [,1] [,2] [,3]
    [1,]   12    8    4
    [2,]   11    7    3
    [3,]   10    6    2
    [4,]    9    5    1
    # Get the value in the 2nd row and 3rd column
    mat[2, 3]
    [1] 3

Examples

mat
     [,1] [,2] [,3]
[1,]   12    8    4
[2,]   11    7    3
[3,]   10    6    2
[4,]    9    5    1

Use vectors to index multiple rows and/or columns.

  • Getting multiple row values

    # 2nd and 4th value in 1st column
    mat[c(2, 4), 1]
    [1] 11  9
  • Getting multiple column values

    # 2nd to 3rd value in 4th row
    mat[4, 2:3]
    [1] 5 1

Leave rows and/or column index empty to get all values.

  • Getting all rows values

    # all row values in 1st column
    mat[, 1]
    [1] 12 11 10  9
  • Getting all column values

    # ll column values in 4th row
    mat[4, ]
    [1] 9 5 1

Exercise ✏️

Photo courtesy of @kolbymilton

  1. Recall that letters is a vector of the alphabet. Create a matrix m1 with 3 rows and 5 columns using the first 15 letters of the alphabet.

    Solution
    m1 <- matrix(letters[1:15], nrow = 3, ncol = 5)
  2. Use cbind() to attach a sixth column with the letters p, q, and r.

    Solution
    m1 <- cbind(m1, c("p", "q", "r"))
  3. Extract the 2nd to 4th column assign them to a new matrix m2.

    Solution
    m2 <- m1[, 2:4]
  4. Create a new matrix m3 by removing the 3rd row of m2.

    Solution
    m3 <- m2[-3, ]

Dataframes

What are dataframes?

Dataframes are similar to matrices, but they can contain different types of data (e.g., numeric and character). Because of this flexibility, they are commonly used to store data in R.

Dataframes are often imported from external sources (e.g., Excel, SPSS, or CSV files). However, we can also create dataframes from scratch with the data.frame()function.

data <- data.frame("id" = c(1, 2, 3, 4, 5),
                   "sex" = c("m", "m", "m", "f", "f"),
                   "age" = c(99, 46, 23, 54, 23))
data
  id sex age
1  1   m  99
2  2   m  46
3  3   m  23
4  4   f  54
5  5   f  23

Selecting rows

We can select rows based on their numerical index (slicing).

  • Numeric indexing (slicing)

    # Get the 2nd and 3rd row
    data[2:3, ]
      id sex age
    2  2   m  46
    3  3   m  23
  • Using dplyr::slice()

    # Get the 2nd and 3rd row
    dplyr::slice(data, 2:3)
      id sex age
    1  2   m  46
    2  3   m  23

We can select rows based on a logical condition (filtering).

  • Logical indexing

    # Get rows for females
    data[data$sex == "f", ]
      id sex age
    4  4   f  54
    5  5   f  23
  • Using dplyr::filter()

    # Get rows for females
    dplyr::filter(data, sex == "f")
      id sex age
    1  4   f  54
    2  5   f  23

Selecting columns

  • Numeric indexing

    # Get the 2nd and 3rd column
    data[, 2:3]
      sex age
    1   m  99
    2   m  46
    3   m  23
    4   f  54
    5   f  23
  • Using dplyr::select()

    # Get the 2nd and 3rd column
    dplyr::select(data, 2:3)
      sex age
    1   m  99
    2   m  46
    3   m  23
    4   f  54
    5   f  23
  • Indexing by name

    # Get the "sex2 and "age" columns
    data[, c("sex", "age")]
      sex age
    1   m  99
    2   m  46
    3   m  23
    4   f  54
    5   f  23
  • Using dplyr::select()

    # Get the "sex2 and "age" columns
    dplyr::select(data, sex, age)
      sex age
    1   m  99
    2   m  46
    3   m  23
    4   f  54
    5   f  23

The $ operator

Retrieving a single column of a dataframe is so common that R provides an own shortcut for this task: the $ operator.

data
  id sex age
1  1   m  99
2  2   m  46
3  3   m  23
4  4   f  54
5  5   f  23
# Get the "age" column
data$age
[1] 99 46 23 54 23
# Get the "sex" column
data$sex
[1] "m" "m" "m" "f" "f"

Note that the $ operator returns the variable as a vector.

Adding columns

The $ operator is a simple way to add new columns to a dataframe.

# Adding a new character and a new numeric column
data$condition <- c("con", "con", "exp", "con", "exp")
data$score <- c(5.5, 2.3, 4.7, 6.7, 3.0)
data
  id sex age condition score
1  1   m  99       con   5.5
2  2   m  46       con   2.3
3  3   m  23       exp   4.7
4  4   f  54       con   6.7
5  5   f  23       exp   3.0

An alternative is the dplyr::mutate() function.

# Adding a new character column
data <- dplyr::mutate(data, 
                      time = c("t1", "t2", "t1", "t2", "t2"),
                      weight = c(70, 80, 65, 60, 75))
data
  id sex age condition score time weight
1  1   m  99       con   5.5   t1     70
2  2   m  46       con   2.3   t2     80
3  3   m  23       exp   4.7   t1     65
4  4   f  54       con   6.7   t2     60
5  5   f  23       exp   3.0   t2     75

Renaming columns

We have already used the names of colums to select them. To see all names of a dataframe, use the names() function.

# Get the names of the dataframe
names(data)
[1] "id"        "sex"       "age"       "condition" "score"     "time"     
[7] "weight"   

The names can be changed by assigning new values.

# Adding a new character column
names(data)[4] <- "group"
names(data)
[1] "id"     "sex"    "age"    "group"  "score"  "time"   "weight"

An alternative is the dplyr::rename() function.

# Adding a new character column
data <- dplyr::rename(data, gender = sex)
names(data)
[1] "id"     "gender" "age"    "group"  "score"  "time"   "weight"

Exercise ✏️

Photo courtesy of @siora18

  1. Create a dataframe demo based on the table on the right.

    Solution
    demo <- data.frame(
      name = c("Alice", "Bob", "Charlie", "David", "Eva"),
      weight = c(165, 175, 180, NA, 160))
  2. Correct the name of the second column to height.

    Solution
    names(demo)[2] <- "height" 
    # alternative: demo <- dplyr::rename(demo, height = weight)
  3. Using the $ operator, convert height from cm to m.

    Solution
    demo$height <- demo$height / 100
    # alternative: demo <- dplyr::mutate(demo, height = height / 100)
  4. Compute the average height.

    Solution
    height_mean <- mean(demo$height, na.rm = TRUE)
  5. Select rows with above-average height.

    Solution
    demo[demo$height > height_mean, ]
name weight
Alice 165
Bob 175
Charlie 180
David
Eva 160

Factors

What are factors?

Dataframes often contain factors that are used to represent categorical variables (e.g., sex, education level, blood type).

A factor can contain only predefined values (called levels) with unique labels. Factors are created using the factor() function.

# Creating a character vector
sex = c("m", "m", "m", "f", "f")
sex
[1] "m" "m" "m" "f" "f"
# Converting to factor with two levels labelled "f" and "m"
factor(sex)
[1] m m m f f
Levels: f m

Factors are very similar to character vectors, but they are treated differently in many statistical analyses and data visualizations.

Change level order

We can change the order of the levels by explicitly specifying them in the factor() function.

# Create a character vector
sex = c("m", "m", "m", "f", "f")

# Order: "f", "m"
factor(sex, levels = c("f", "m"))
[1] m m m f f
Levels: f m
# Order: "m", "f"
factor(sex, levels = c("m", "f"))
[1] m m m f f
Levels: m f

This can be useful for reordering the levels in an analysis or plot (e.g., to change the order of the bars for males and females in a barplot).

Renaming factor levels

We can also rename the levels of a factor by specifying the new names in the levels argument of the factor() function.

# Create a character vector
sex = c("m", "m", "m", "f", "f")

# Ordinary factor
factor(sex, levels = c("f", "m"))
[1] m m m f f
Levels: f m
# Renamed factor
factor(sex, levels = c("m", "f"), labels = c("male", "female"))
[1] male   male   male   female female
Levels: male female

This can be useful for relabelling the levels in an analysis or plot.